160 research outputs found

    Performance of random forest when SNPs are in linkage disequilibrium

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Single nucleotide polymorphisms (SNPs) may be correlated due to linkage disequilibrium (LD). Association studies look for both direct and indirect associations with disease loci. In a Random Forest (RF) analysis, correlation between a true risk SNP and SNPs in LD may lead to diminished variable importance for the true risk SNP. One approach to address this problem is to select SNPs in linkage equilibrium (LE) for analysis. Here, we explore alternative methods for dealing with SNPs in LD: change the tree-building algorithm by building each tree in an RF only with SNPs in LE, modify the importance measure (IM), and use haplotypes instead of SNPs to build a RF.</p> <p>Results</p> <p>We evaluated the performance of our alternative methods by simulation of a spectrum of complex genetics models. When a haplotype rather than an individual SNP is the risk factor, we find that the original Random Forest method performed on SNPs provides good performance. When individual, genotyped SNPs are the risk factors, we find that the stronger the genetic effect, the stronger the effect LD has on the performance of the original RF. A revised importance measure used with the original RF is relatively robust to LD among SNPs; this revised importance measure used with the revised RF is sometimes inflated. Overall, we find that the revised importance measure used with the original RF is the best choice when the genetic model and the number of SNPs in LD with risk SNPs are unknown. For the haplotype-based method, under a multiplicative heterogeneity model, we observed a decrease in the performance of RF with increasing LD among the SNPs in the haplotype.</p> <p>Conclusion</p> <p>Our results suggest that by strategically revising the Random Forest method tree-building or importance measure calculation, power can increase when LD exists between SNPs. We conclude that the revised Random Forest method performed on SNPs offers an advantage of not requiring genotype phase, making it a viable tool for use in the context of thousands of SNPs, such as candidate gene studies and follow-up of top candidates from genome wide association studies.</p

    SNPInterForest: A new method for detecting epistatic interactions

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Multiple genetic factors and their interactive effects are speculated to contribute to complex diseases. Detecting such genetic interactive effects, i.e., epistatic interactions, however, remains a significant challenge in large-scale association studies.</p> <p>Results</p> <p>We have developed a new method, named SNPInterForest, for identifying epistatic interactions by extending an ensemble learning technique called random forest. Random forest is a predictive method that has been proposed for use in discovering single-nucleotide polymorphisms (SNPs), which are most predictive of the disease status in association studies. However, it is less sensitive to SNPs with little marginal effect. Furthermore, it does not natively exhibit information on interaction patterns of susceptibility SNPs. We extended the random forest framework to overcome the above limitations by means of (i) modifying the construction of the random forest and (ii) implementing a procedure for extracting interaction patterns from the constructed random forest. The performance of the proposed method was evaluated by simulated data under a wide spectrum of disease models. SNPInterForest performed very well in successfully identifying pure epistatic interactions with high precision and was still more than capable of concurrently identifying multiple interactions under the existence of genetic heterogeneity. It was also performed on real GWAS data of rheumatoid arthritis from the Wellcome Trust Case Control Consortium (WTCCC), and novel potential interactions were reported.</p> <p>Conclusions</p> <p>SNPInterForest, offering an efficient means to detect epistatic interactions without statistical analyses, is promising for practical use as a way to reveal the epistatic interactions involved in common complex diseases.</p

    Use of principal components to aggregate rare variants in case-control and family-based association studies in the presence of multiple covariates

    Get PDF
    Rare variants may help to explain some of the missing heritability of complex diseases. Technological advances in next-generation sequencing give us the opportunity to test this hypothesis. We propose two new methods (one for case-control studies and one for family-based studies) that combine aggregated rare variants and common variants located within a region through principal components analysis and allow for covariate adjustment. We analyzed 200 replicates consisting of 209 case subjects and 488 control subjects and compared the results to weight-based and step-up aggregation methods. The principal components and collapsing method showed an association between the gene FLT1 and the quantitative trait Q1 (P<10−30) in a fraction of the computation time of the other methods. The proposed family-based test has inconclusive results. The two methods provide a fast way to analyze simultaneously rare and common variants at the gene level while adjusting for covariates. However, further evaluation of the statistical efficiency of this approach is warranted

    A Two-Stage Random Forest-Based Pathway Analysis Method

    Get PDF
    Pathway analysis provides a powerful approach for identifying the joint effect of genes grouped into biologically-based pathways on disease. Pathway analysis is also an attractive approach for a secondary analysis of genome-wide association study (GWAS) data that may still yield new results from these valuable datasets. Most of the current pathway analysis methods focused on testing the cumulative main effects of genes in a pathway. However, for complex diseases, gene-gene interactions are expected to play a critical role in disease etiology. We extended a random forest-based method for pathway analysis by incorporating a two-stage design. We used simulations to verify that the proposed method has the correct type I error rates. We also used simulations to show that the method is more powerful than the original random forest-based pathway approach and the set-based test implemented in PLINK in the presence of gene-gene interactions. Finally, we applied the method to a breast cancer GWAS dataset and a lung cancer GWAS dataset and interesting pathways were identified that have implications for breast and lung cancers

    A polymorphic variant of the insulin-like growth factor 1 (IGF-1) receptor correlates with male longevity in the Italian population: a genetic study and evaluation of circulating IGF-1 from the "Treviso Longeva (TRELONG)" study

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>An attenuation of the insulin-like growth factor 1 (IGF-1) signaling has been associated with elongation of the lifespan in simple metazoan organisms and in rodents. In humans, IGF-1 level has an age-related modulation with a lower concentration in the elderly, depending on hormonal and genetic factors affecting the IGF-1 receptor gene (<it>IGF-1R</it>).</p> <p>Methods</p> <p>In an elderly population from North-eastern Italy (<it>n </it>= 668 subjects, age range 70–106 years) we investigated the <it>IGF-1R </it>polymorphism G3174A (<it>rs2229765</it>) and the plasma concentration of free IGF-1. Frequency distributions were compared using χ<sup>2</sup>-test "Goodness of Fit" test, and means were compared by one-way analysis of variance (ANOVA); multiple regression analysis was performed using JMP7 for SAS software (SAS Institute, USA). The limit of significance for genetic and biochemical comparison was set at α = 0.05.</p> <p>Results</p> <p>Males showed an age-related increase in the A-allele of <it>rs2229765 </it>and a change in the plasma level of IGF-1, which dropped significantly after 85 years of age (85+ group). In the male 85+ group, A/A homozygous subjects had the lowest plasma IGF-1 level. We found no clear correlation between <it>rs2229765 </it>genotype and IGF-1 in the females.</p> <p>Conclusion</p> <p>These findings confirm the importance of the <it>rs2229765 </it>minor allele as a genetic predisposing factor for longevity in Italy where a sex-specific pattern for IGF-1 attenuation with ageing was found.</p

    Conditional variable importance for random forests

    Get PDF
    Random forests are becoming increasingly popular in many scientific fields because they can cope with ``small n large p'' problems, complex interactions and even highly correlated predictor variables. Their variable importance measures have recently been suggested as screening tools for, e.g., gene expression studies. However, these variable importance measures show a bias towards correlated predictor variables. We identify two mechanisms responsible for this finding: (i) A preference for the selection of correlated predictors in the tree building process and (ii) an additional advantage for correlated predictor variables induced by the unconditional permutation scheme that is employed in the computation of the variable importance measure. Based on these considerations we develop a new, conditional permutation scheme for the computation of the variable importance measure. The resulting conditional variable importance is shown to reflect the true impact of each predictor variable more reliably than the original marginal approach

    Exceptionally low likelihood of Alzheimer's dementia in APOE2 homozygotes from a 5,000-person neuropathological study

    Get PDF
    Each additional copy of the apolipoprotein E4 (APOE4) allele is associated with a higher risk of Alzheimer's dementia, while the APOE2 allele is associated with a lower risk of Alzheimer's dementia, it is not yet known whether APOE2 homozygotes have a particularly low risk. We generated Alzheimer's dementia odds ratios and other findings in more than 5,000 clinically characterized and neuropathologically characterized Alzheimer's dementia cases and controls. APOE2/2 was associated with a low Alzheimer's dementia odds ratios compared to APOE2/3 and 3/3, and an exceptionally low odds ratio compared to APOE4/4, and the impact of APOE2 and APOE4 gene dose was significantly greater in the neuropathologically confirmed group than in more than 24,000 neuropathologically unconfirmed cases and controls. Finding and targeting the factors by which APOE and its variants influence Alzheimer's disease could have a major impact on the understanding, treatment and prevention of the disease

    A Functional Polymorphism in Renalase (Glu37Asp) Is Associated with Cardiac Hypertrophy, Dysfunction, and Ischemia: Data from the Heart and Soul Study

    Get PDF
    Renalase is a soluble enzyme that metabolizes circulating catecholamines. A common missense polymorphism in the flavin-adenine dinucleotide-binding domain of human renalase (Glu37Asp) has recently been described. The association of this polymorphism with cardiac structure, function, and ischemia has not previously been reported.We genotyped the rs2296545 single-nucleotide polymorphism (Glu37Asp) in 590 Caucasian individuals and performed resting and stress echocardiography. Logistic regression was used to examine the associations of the Glu37Asp polymorphism (C allele) with cardiac hypertrophy (LV mass>100 g/m2), systolic dysfunction (LVEF<50%), diastolic dysfunction, poor treadmill exercise capacity (METS<5) and inducible ischemia.Compared with the 406 participants who had GG or CG genotypes, the 184 participants with the CC genotype had increased odds of left ventricular hypertrophy (OR = 1.43; 95% CI 0.99-2.06), systolic dysfunction (OR = 1.72; 95% CI 1.01-2.94), diastolic dysfunction (OR = 1.75; 95% CI 1.05-2.93), poor exercise capacity (OR = 1.61; 95% CI 1.05-2.47), and inducible ischemia (OR = 1.49, 95% CI 0.99-2.24). The Glu37Asp (CC genotype) caused a 24-fold decrease in affinity for NADH and a 2.3-fold reduction in maximal renalase enzymatic activity.A functional missense polymorphism in renalase (Glu37Asp) is associated with cardiac hypertrophy, ventricular dysfunction, poor exercise capacity, and inducible ischemia in persons with stable coronary artery disease. Further studies investigating the therapeutic implications of this polymorphism should be considered

    Bias in random forest variable importance measures: Illustrations, sources and a solution

    Get PDF
    BACKGROUND: Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types, for example when predictors include both sequence data and continuous variables such as folding energy, or when amino acid sequence data show different numbers of categories. RESULTS: Simulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand. CONCLUSION: We propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analyzing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research
    corecore